Skip to content

Implement instructions for matching builtin character classes and assertions #547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Aug 3, 2022

Conversation

rctcwyvrn
Copy link
Contributor

@rctcwyvrn rctcwyvrn commented Jul 5, 2022

  • Adds matchBuiltin to match builtins instead of a consumeFn calling _CharacterClassModel.matches in most cases
  • Removes AssertionFunction and changes assertBy to match in Processor instead

There are some gains (10% in microbenchmarks on builtin character classes) but the main purpose is to set up for future work

Future work:

  • The layout of the payloads for matchBuiltin and assertBy will be optimized (hopefully soon) along with the rest of the instruction layout. Changes made in this branch and the scalar optimization branch showed that this is a strong next step
  • I wasn't able to fully nuke _CharacterClassModel.matches but I think we're getting very close to being able to rip out most of it and the consumer interface for CustomCharacterClass

@rctcwyvrn rctcwyvrn marked this pull request as draft July 5, 2022 23:51
@rctcwyvrn
Copy link
Contributor Author

rctcwyvrn commented Jul 6, 2022

WIP because I need to figure out these weird performance behaviours

Weird thing number 1: The first commit in this branch (copying over the AsciiBitset changes from the matchScalar PR) is resulting in significant speedups

=== Improvements =====================================================================
- emailLookaheadNoMatchesAll              58ms	63ms	-5.05ms		-8.0%
- emailLookaheadNoMatchesFirst            58ms	63ms	-4.95ms		-7.9%
- emailLookaheadAll                       81.9ms	86.8ms	-4.95ms		-5.7%
- basicCCC                                9.63ms	10.7ms	-1.08ms		-10.1%
- caseInsensitiveCCC                      11.3ms	12.3ms	-1.05ms		-8.6%
- invertedCCC                             28.5ms	29.4ms	-946µs		-3.2%
- emailRFCAll                             50.2ms	51.1ms	-942µs		-1.8%
- basicRangeCCC                           10.6ms	11.4ms	-782µs		-6.9%

I sanity checked this by trying to minimize the change (not copying it over to a new file, not adding the new scalar api) and got the same result.

diff --git a/Sources/_StringProcessing/ConsumerInterface.swift b/Sources/_StringProcessing/ConsumerInterface.swift
index af46b53..9c41002 100644
--- a/Sources/_StringProcessing/ConsumerInterface.swift
+++ b/Sources/_StringProcessing/ConsumerInterface.swift
@@ -11,6 +11,13 @@
 
 @_implementationOnly import _RegexParser
 
+extension Character {
+   var _singleScalarAsciiValue: UInt8? {
+     guard self != "\r\n" else { return nil }
+     return asciiValue
+   }
+ }
+
before
      internal func matches(char: Character) -> Bool {
        let ret: Bool
        if let val = char.asciiValue {
          if val < 64 {
            ret = (a >> val) & 1 == 1
          } else {
            ret =  (b >> (val - 64)) & 1 == 1
          }
        } else {
          ret = false
        }

        if isInverted {
          return !ret
        }

        return ret
      }
after
      private func matches(_ val: UInt8) -> Bool {
        if val < 64 {
          return (a >> val) & 1 == 1
        } else {
          return (b >> (val - 64)) & 1 == 1
        }
      }

       internal func matches(char: Character) -> Bool {
         let matched: Bool
         if let val = char._singleScalarAsciiValue {
           matched = matches(val)
         } else {
           matched = false
         }

         if isInverted {
           return !matched
         }
         return matched
       }

Some quick profiling seems to indicate that the performance increase came the call to _singleScalarAsciiValue being faster than the call to asciiValue, even though _singleScalarAsciiValue also uses asciiValue??? My best guess is that it has to do with Character.asciiValue.getter being inlined before the change but not after, though I honestly don't know why that would make a difference

@rctcwyvrn
Copy link
Contributor Author

rctcwyvrn commented Jul 6, 2022

Weird performance thing number 2: The optimization is causing regressions in regexes that do not produce the new instruction and very little improvement in the ones that do

Comparing with the first commit on the branch we get larger regressions in all of the email regexes and smaller ones in almost every other result

Comparing against benchmark result file sanity_check_after_bitset.json
=== Regressions ======================================================================
- htmlFirst                               26.9µs	26.7µs	209ns		0.8%
- emailLookaheadFirst                     18.5µs	18.2µs	250ns		1.4%
- emailRFCFirst                           8.38µs	7.92µs	458ns		5.8%
- cssFirst                                93µs	91.9µs	1.17µs		1.3%
- HangulSyllableFirst                     4.06ms	4.01ms	47.8µs		1.2%
- cssAll                                  4.87ms	4.82ms	47.9µs		1.0%
- caseInsensitiveCCC                      11.5ms	11.4ms	107µs		0.9%
- ReluctantQuantWithTerminalWhole         10.6ms	10.4ms	113µs		1.1%
- basicRangeCCC                           10.7ms	10.5ms	126µs		1.2%
- HangulSyllableAll                       8.41ms	8.24ms	167µs		2.0%
- htmlAll                                 13.5ms	13.3ms	205µs		1.5%
- subtractionCCC                          15.3ms	15.1ms	278µs		1.8%
- intersectionCCC                         15.8ms	15.5ms	278µs		1.8%
- notFoundAll                             15.1ms	14.8ms	285µs		1.9%
- notFoundFirst                           15.1ms	14.8ms	298µs		2.0%
- symDiffCCC                              39.2ms	38.3ms	885µs		2.3%
- invertedCCC                             29.5ms	28.6ms	937µs		3.3%
- emailLookaheadNoMatchesFirst            59.5ms	58.3ms	1.12ms		1.9%
- emailLookaheadNoMatchesAll              59.3ms	58.2ms	1.13ms		1.9%
- emailRFCAll                             51.5ms	50ms	1.53ms		3.1%
- emailLookaheadAll                       83.5ms	81.8ms	1.72ms		2.1%
- emailRFCNoMatchesAll                    110ms	107ms	3.03ms		2.8%
- emailRFCNoMatchesFirst                  110ms	107ms	3.06ms		2.8%
=== Improvements =====================================================================
- GraphemeBreakNoCapAll                   9.2ms	9.45ms	-257µs		-2.7%
- EagarQuantWithTerminalWhole             6.67ms	6.76ms	-83.9µs		-1.2%
- GraphemeBreakNoCapFirst                 79.2µs	82.8µs	-3.54µs		-4.3%
  • All tests are passing so I don't think it's because there's a bug somewhere
  • Checking the compilation does confirm that neither email regex is generating the new instruction
  • The current benchmark suite doesn't have many benchmarks that mostly use builtin character classes, most mostly use custom classes and only use the builtins for whitespace. Need to add new microbenchmarks and real world benchmarks that use builtins.

@rctcwyvrn
Copy link
Contributor Author

rctcwyvrn commented Jul 6, 2022

Removing the remnants of the bitset fast path optimization I was testing seems to have reverted most of the regressions

Comparing against benchmark result file sanity_check_after_bitset.json
=== Regressions ======================================================================
- emailRFCFirst                           8.17µs	7.92µs	249ns		3.1%
- invertedCCC                             28.8ms	28.6ms	167µs		0.6%
- emailLookaheadAll                       82.3ms	81.8ms	446µs		0.5%
- emailRFCAll                             50.5ms	50ms	528µs		1.1%
=== Improvements =====================================================================
- GraphemeBreakNoCapAll                   9.02ms	9.45ms	-430µs		-4.6%
- caseInsensitiveCCC                      11ms	11.4ms	-375µs		-3.3%
- basicCCC                                9.54ms	9.9ms	-359µs		-3.6%
- ReluctantQuantWhole                     14ms	14.2ms	-238µs		-1.7%
- EagarQuantWithTerminalWhole             6.54ms	6.76ms	-216µs		-3.2%
- ReluctantQuantWithTerminalWhole         10.3ms	10.4ms	-184µs		-1.8%
- htmlAll                                 13.1ms	13.3ms	-137µs		-1.0%
- intersectionCCC                         15.4ms	15.5ms	-121µs		-0.8%
- basicRangeCCC                           10.5ms	10.5ms	-77.5µs		-0.7%
- cssAll                                  4.74ms	4.82ms	-74µs		-1.5%
- GraphemeBreakNoCapFirst                 78.5µs	82.8µs	-4.25µs		-5.1%
- cssFirst                                90.8µs	91.9µs	-1.13µs		-1.2%
- htmlFirst                               26.4µs	26.7µs	-291ns		-1.1%

But why?

  • Why did an extra computed property on BuiltinCC and an extra computation when emitting a matchBuiltin affect all these other benchmarks????
  • Why are the email regexes still regressing?? (albeit only a little bit now, it could just be variance from my machine)
  • Why do we now have improvements on regexes that don't have builtins (all the CCC and reluctant quant benchmarks) ???

Need to make 100% sure that compile time is not affecting the benchmark times.

A lot of this weirdness could also just be due to inaccurate timing from the benchmarker and variance because I'm running everything locally

@rctcwyvrn rctcwyvrn changed the title [WIP] Implement instructions for matching builtin character classes Implement instructions for matching builtin character classes and assertions Jul 12, 2022
@rctcwyvrn rctcwyvrn marked this pull request as ready for review July 12, 2022 01:04
@rctcwyvrn
Copy link
Contributor Author

@swift-ci test

case .verticalWhitespace:
matched = c.isNewline && (c.isASCII || !isStrictAscii)
case .newlineSequence:
matched = c.isNewline && (c.isASCII || !isStrictAscii)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was carried over from the old code, but why does this match a full \r\n sequence when in unicode scalars mode? Should we change that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's what is specified in the Unicode proposal:

Additionally, \R and CharacterClass.newline provide a way to include the CR+LF pair, even when matching with Unicode scalar semantics.

I don't know the exact rationale behind it, but I would assume it's done in an effort to be compatible with other regex engines that match scalars by default but with \R still matching \r\n.

return currentPosition == subjectBounds.upperBound
}

case .wordBoundary:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be no test cases that use the usesSimpleUnicodeBoundaries option, should we be testing that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should be.

@Azoy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a couple: testSimpleWordBoundaries and https://github.com/apple/swift-experimental-string-processing/blob/96fb215facf425226ca0645d5cab6e84a3befece/Tests/RegexBuilderTests/RegexDSLTests.swift#L540 but the latter was xfail, I have to fix the test because the test is wrong.

@rctcwyvrn rctcwyvrn requested a review from milseman July 12, 2022 01:16
@rctcwyvrn
Copy link
Contributor Author

@swift-ci test

switch semanticLevel {
case .graphemeCluster:
return input.index(after: currentPosition) == subjectBounds.upperBound
&& input[currentPosition].isNewline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @Azoy , these are more examples of a consume operation that we want for more efficient access to String. Something like func nom() -> (current: Character, next: Index)

return currentPosition == subjectBounds.upperBound
}

case .wordBoundary:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely should be.

@Azoy?

@rctcwyvrn
Copy link
Contributor Author

I may have gotten a little carried away while updating this branch to match main (internalization of _CharacterClassModel and changes to assertions) and I ended up implementing a new DSLTree node for character classes

It seems to all be working but there's a slight hiccup in inverting character classes that I'll work out on Monday

@rctcwyvrn
Copy link
Contributor Author

@swift-ci test

- matchBuiltin always fails if at endIndex
- fix switch in isStrictAscii
@rctcwyvrn
Copy link
Contributor Author

@swift-ci test

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In progress review, but plenty to communicate. Looks great! Lots of little tweaks and clarifications but this is awesome stuff!

@@ -162,6 +162,8 @@ extension DSLTree.Atom {
case .assertion:
// TODO: We could handle, should this be total?
return nil
case .characterClass(let cc):
return cc.generateConsumer(opts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future: get rid of all these consumers except custom ones like custom character classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason why all these consumer functions + _CharacterClassModel.matches still exist is because we have to emit any non-ascii CustomCharacterClass as one big consumer so it needs to know how to create a consumer for each possible member type (which includes Atom).

Do you think we could emit it as an ordered choice over its members instead? I think the main thing to add would be instructions for handling the custom character class set operations, and if we had that then we could rip out all of the non character property related stuff in ConsumerInterface

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid emitting backtracking traffic for custom character classes if we can. However, since there's a compilation boundary around each member anyways, that might not account for much. Let's keep this for now and try to come back to it again

Copy link
Contributor Author

@rctcwyvrn rctcwyvrn Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing to consider is that if we keep non-ascii custom character classes as one big consumer we're unable to access any optimized matching for any of its members.

Also just no longer having all the matching logic exactly duplicated across ConsumerInterface and Processor seems like something pretty important for future extensibility/preventing bugs. We pay some cost for emitting backtracking traffic but this is mostly for uncommon cases, the most common ascii cases are already handled by the bitset.

}

func isAtEndOfLine(_ payload: AssertionPayload) -> Bool {
if currentPosition == subjectBounds.upperBound { return true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... similarly non-aligned upper bounds.

switch payload.semanticLevel {
case .graphemeCluster:
return input.index(after: currentPosition) == subjectBounds.upperBound
&& input[currentPosition].isNewline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, alignment?

case .endOfSubject: return currentPosition == subjectBounds.upperBound

case .resetStartOfMatch:
fatalError("Unreachable, we should have thrown an error during compilation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future: we'll want to version support so that we can reason about back deployment

- static vars in payloads
- Clean up _CharacterClassModel
- Use the model for bytecodegen and consumer interface
- Merge the grapheme and scalar match builtin cases together
@rctcwyvrn
Copy link
Contributor Author

@swift-ci test

@rctcwyvrn rctcwyvrn merged commit 405fbcb into swiftlang:main Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants